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METHOD AND SYSTEM 
FOR NETWORK INFORMATION ACCESS 

1. FIELD OF THE INVENTION 
5 The field of this invention relates to information 

access over networks, and specifically to the automatic 
location and evaluation of relevant information available 
over public or private networks from information sources in 
response to user queries. 

xo 

2. BACKGROUND 

The exponential growth of private intranets and the 
public Internet has produced a daunting labyrinth of 
increasingly numerous documents, databases and utilities. 

15 Almost any type of information is now available somewhere, 
but mort users cannot find what they seek, and even expert 
users v/aste copious time and effort searching for appropriate 
information sources. One problem is simply the increasingly 
large number of available information sources that are beyond 

20 the comprehension of a single user. A second problem, along 
with this growth in available information and information 
sources, is a commensurate growth in software utilities and 
methods to manage, access, and present this information. 
Each utility has a different and often unique interface and 

25 set of commands and capabilities, and is appropriate for a 
different set of users and a different set of information 
types and sources. Thus sheer diversity of available 
utilities creates problem for users comparable to that 
created by information explosion. Users are now faced with 

30 the twin problems of which tool to use to inquire at which 
information source. 

In the past efforts have been made to provide users with 
automatic, computer assisted services that can help solve 
these twin problems of the network revolution. For example, 

35 AI researchers have created several prototype software agents 
that help users with e-mail and netnews filtering (Pattie 
Maes et al. , 1993, Learning interface agents, Proc&edings of 
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JUM-M). aga-ts that assist with world wid. we* browsing (H. 

£lrcn. 1»5. Letitia: An agent that assist, web brow. in,. 

Proc. 15th int. Joint conf. on A.I. PP- Kobert 

strong at .1.. 1992. Webwatcher: A learning apprentice far 
5 ths world vide w.b, Working »otes of tha AAAI spring 
Synposiun: Information Gathering fro. Hetarogeneous, 
^Lm Environcnts. pp. «-12. Stanford omversity. MAI 
Prass) , agents that sohadula stings (Lisa Dont at al. , 
press), »g=i ,„™ti~ Proe. 10th Hat. Conf. 

1992. A personal learning apprentice, Froc. i 
in m i l pp. 96-103; Patti. Maes, 1994, Agents that reduce 

16 of «. «» 

146, Ton Mitchell et al., 1994, Experience with a l«rning 
U=nal assistant. Conn, of ths ACH aZ<7> . «- 
that parforn internet-related task. (0. Etzioni et al., 1994, 
IS nXt-hasod innate to the internet. CA=« 22(7) : 72-75) . 
increasingly, the infarction such agants need " access „ 
available on the World Wide Web. Onfortunately, even a 
donain as standardised as the WWW has turned out to pes. 
signif icent problem for eutonatic software agents^ For one, 
2. enough weh pages are universally written in Hypertext 
Markup Language ("HTML") , this language -ely defines t.. 
f orct of information display, .akin, no atte.pt to hint at 
its naanin, or secntic content, Currently. - ""^ 
-senantic =arKup language" far the web exits, nor is one 
25 likely to adopted universally. The Internet can be expected 
to pose even greater probleas. 

Thus, the advent of intranets, the Internet, and the 
World Wide web have posed several fund»ental problem for 
the aulatic services or agents designed to assist users » 
fi*d relavant infarction, first, no one such 
heretofore provided sufficient additional value to replace 
O.S use of a W«b browser having access to existing 
directories and indices such a. <shoo or Lycos. 
services have not yet been able to understand and 
,5 parse relevant infarction fro. the respond returned fr~ . 
wide v«iaty of Internet and web inforction sources. Third, 
existing services and agents have not bean easy to adapt to 
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the ever-increasing numbers of sources with their ever- 
changing response formats • This is due to the 
individualized, hand-coded interface to each Internet service 
and Web site utilized by existing agents (Yigal Arens et al., 
5 1993 , Retrieving and integrating data from multiple 

information sources, International Journal on Intelligent and 
Cooperative Information Systems 2(2) ;127-158; O. Etzioni et 
al., 1994, A softbot-based interface to the internet, CACM 
37(7):72-75; B. Krulwich, 1995, Bargain finder agent 

10 prototype, Technical report, Anderson Consulting; Alon Y. 
Levy et al., 1995, Data model and query evaluation in global 
information systems, Journal of Intelligent Information 
Systems, Special Issue on Netvrorked Information Discovery and 
Retrieval 5(2); Mike Perkowitz et al., 1995, Category 

15 Lrausic.Lion: Learning to understand information on the 

internet, Proc. 15th Int. Joint Conf. on* A. I.). Preferably, 
a service or agent should be able to access a new or changed 
Internet information source in order to automatically learn 
how tc retrieve relevant information from the source. This 

20 would be advantageous even if such a facility were limited to 
groups of sources with response formats selected according to 
certain constraining principles. 

3. BPKMARY OF THE INVENTION 

25 It is a broad object of this invention to solve these 

fundamental problems by a method and system that provide a 
personalized network robot, called a "netbot." A netbot 
acts as a user's intelligent assistant by tracking available 
network information sources, knowing the relevant information 

30 and features of each particular source, and upon user request 
determining which sources are relevant to a given query, 
forwarding the query to the most relevant information 
sources, understanding the responses returned from each 
source, and integrating and intelligently presenting the 

35 query results to the user. 

The netbots of this invention possess several 
advantages, including the following. First, a netbot returns 
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only the most relevant information to the user. On the one 
hand, each user query is forwarded only to the primary 
information sources determined to be most relevant. On the 
other hand, information source responses are parsed «f 
5 understood so that only the relevant data items are extracted 
for user presentation. Duplicate, stale, and irrelevant 
information items are discarded. Second, a netbot is fast. 
Sinoe it automatically searches the relevant primary sources 
in parallel, it can present information as quickly as the 
10 Lstest primary source returns a response. Despite changing 
conditions which cause different information sources to 
fluctuate in speed, a netbot integrator remains as fast as 
the fastest source. Sources that have no information to 
return to a query do not slow the user since the netbot 
ie snores them. Third, netbots =r-, easily adapted to 

" ^/ever-Increasing number of network information sources 
• with ever-changing response formats. Hetbots utilize a new 
and novel declarative language for describing information 
sources. A source description is short and easily 
20 understandable, and therefore is easily written and 

^nerefore, in one aspect the invention includes a method 
for efficient access to information sources on a network 
comprising preferably one or more of the following steps: 
as receiving a user query for information; determining the 

information sources most relevant to this query; retrieving a 
action of each information source; formatting the query 
according to this description in a manner suitable for each 
information source and transmitting the formatted query to 
30 theTource; receiving responses from the information sources; 
for each source, understanding and extracting the relevant 
data fields according to the retrieved description; and 
presenting to the user the relevant data from each 
information source in an intelligent manner ranked by an 
3S estimate of its relevance. Advantageously, these steps are 
performed in parallel to the greatest extent possible In 
particular, at least, all queries are transmitted to all 

- 4 - 
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relevant information sources in parallel without waiting for 
intervening responses. 

In another aspect the invention comprises a computer 
system and apparatus for performing one or more steps of the 
5 method of this invention. The user has a presentation device 
attached to a network to which is also attached a plurality 
of information sources. The presentation device receives 
user queries and displays netbot responses. Further, the 
presentation device performs one or more of the steps of the 

10 method of this invention. One or more of those steps not 
performed on this device can advantageously be performed on 
network attached netbot server computers, which respond to 
functional requests from the user device. Optionally, the 
user device can range from a diskless hand-held terminal, to 

l» a PC, to a vorkstution, and so fortn. 

In a further aspect the invention comprises a new 
language and language implementation to facilitate the 
creation and maintenance of descriptions of information 
sources. Importantly, this language recognizes relevant data 

20 fields in responses returned from information sources and is 
capable of extracting all such fields. In the preferred 
embodiment this language has an action statement component 
and a regular expression component. The regular expression 
component has novel features for creating modular 

25 hierarchical descriptions of regular expressions, for binding 
variables to the correct sub-strings recognized during 
pattern match to a response of an information source, for 
performing arbitrary action language statements with multiple 
variable bindings, and for specifying backtrack free 

30 recognition of sub-strings where possible. 

«• BRIEF DESCRIPTION OF THE DRAWINQS 

These and other features, aspects, and advantages of the 
present invention will become better understood by reference 
3S to the accompanying drawings, following description, and 
appended claims, where: 

Fig. l illustrates generally a netbot of this invention; 
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Figs. 2A-B illustrate exemplary user interfaces of 
embodiments of the netbot of Fig. 1; 

Fig. 3 illustrates exemplary functional components of 

the netbot of Fig. 1? and „ 
5 Fig. < illustrates alternative hardware embedments of 

the netbot of Fig. 1. 

5 . ™*™tt.to descriptiM 
For clarity of disclosure, and not by way of limitation, 
10 the detailed description of a netbot of this invents 
presented as a method or process for netvor* information 
access, as a system or apparatus implemented to perform that 
»ethod, and as a novel language designed to assxst in 
implementing this method and system. 
P .. . firs - an overview of these aspects of 

the invention is presented followed, second, by a detarled 
discussion of individual components. 

5.1.. ^.trrn. OF TTTT WW " eHTT ' !CTTOE 
A netbot method or system of this invention comprises 
* software and hardware facilities that function toother „ 
one or more network attached counters to assxst a user m 
ZZ to information present in network attached servers 
" herein as -information sources-, . 1 ^ 

„ Uustrat~ the relationships of a netbot to a user and to 
.forked information sources. For ex^ple. user 1 accesses 
"omputer 3 through standard interface devxces such a. 
"T it in. in the course of work, the user needs formation 
r=m Lotion sources 7, attached to the user computer 
„ Lou*, various network links, such as network links .and .. 
Since the infarction sources are many, the user can benefit 
frressistance in finding needed information fro. relevant 
formation sources. This assistance is provided by netbot 
, which maintains an awaraness of «^ „ 
« sources and series them through links 6 on behalf of. or 
" ™t of the user. Alternatively, netbot 5 can part y or 
wholly reside on user computer I or be partially or wholly 



- 6 - 



BNSDOCID. <WO 9812681A2J_> 



WG9S/12881 PCT/US97/17132 



distributed on the network and accessed by the user through 
link 4. 

Groups of sources 7 having similar sorts of information 
are grouped into conceptual classes called information 
5 domains. For example, one domain can be that of electronic 
stores for a particular product; another domain might include 
Internet indexes containing information on the keyword 
content of various World Wide Web ("WWW*) pages . 

In a preferred embodiment, the netbot is composed of 

10 three major functional modules: a user interface, an 

integrator, and an I/O manager. Briefly, the user interface 
module interacts with the user to receive user queries for 
information, and to format and present information responses 
received from the network attached information sources. 
Advantageously,* the user interface is adapted to the specific 
information domain being accessed. The integrator module 
accepts a user query from the user interface module, selects 
relevant information sources, formats it for network 
transmission to each relevant information source, receives 

20 responses from these sources, understands these responses, 
and passes the relevant portions of the responses back to the 
user interface module for display to the user. The I/O 
manager module performs hardware, operating system, and 
network specific interfacing for the user interface and 

25 integrator modules so that porting a netbot to different 
hardware platforms, operating systems, or networks requires 
changes only in well-modularized code. 

In particular alternative implementations, either or 
both of the user interface or the I/O manager modules may be 

30 absent. For example, the functions of these modules might be 
already performed by other operating system components, h 
netbot can be provide only one or more of the facilities 
disclosed without providing others. For example, in some 
embodiments, a netbot is useful that preforms query routing, 

35 alone or in combination with relevance ranking. In other 
embodiments, a netbot can simply format queries and 
understand responses. Further, additional modules may be 

- 7 - 
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present in or accessed by a netbot. For example, a learnxng 
Ld«2. can be present which provides for a netbot to acquire 
automatically the characteristics of a new network 
information source. Finally, as is known to those of skill 
5 in the art, the functions performed by the described modules 
uay be divided or grouped in alternative fashions among a 
greater or lesser number of modules. 

in the absence of specific preferences, the processes of 
this invention can be implemented in complete procedural 
xo programming language, such as C, or a complete °**<* 

Lfented programming language, such as C++, on the disclosed 
hardware configurations. 

■n more detail, the user interface module has both 
~ important functionality that is common to netbot user 
interfaces, whatever the information domain to 
are directed, and also has adaptations to the Particular 
information domain of a particular netbot. Turing -U st to 
20 the preferable common functions, one such xs ^e ability to 
remember a user's preferences for interacting wxth a netbot 
Such remembered preferences include, for example, whether a 
netbot is to fetch pages in order to calculate thexr 
relevance itself, the number H or relevant information 
ZS sources to query, display characteristics, etc. X second 
such common function is incremental display of xnformatxon 
received from a query. Since each query usually causes a 
netbot to consult many independent information sources, 
results are often received at widely varying times. 
30 immediate display of asynchronously received results would 
cause such undesirable effects as screen f licker or 
disorienting rearrangement of already displayed data 
incremental display accommodates, on the one hand, user 
desires to view information quickly with, on the other hand, 
35 user desires for a comprehensive view of all recexved 
information, sorted according to relevance. 
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The incremental display strategy preferably provides one 
or more windows on the user's screen with several defined 
views of the query satisfaction process along with certain 
common user controls, such as screen buttons, for 
5 manipulating these windows. In one window, the user 

interface module presents lists of the information sources 
being consulted with each source symbolically represented as, 
for example, a network address, an icon, or another compact ' 
screen representation. Then, as an answer is received from a 

10 particular source, its associated screen representation 
changes appearance, e.g., by changing intensity, changing 
color, etc. Also displayed is a count of the total number of 
unique information items received currently. Optionally, 
clicking on the screen representation of an information 

IS source opens a further window with either information about 
thit information source, or a display of the responses 
received from it, or access to the information source over 
the network, etc. 

One of the common controls is preferably implemented as 

20 a common show-me-button that when activated causes display of 
another window in which a list of all currently received 
responses is presented ranked according to their estimated 
relevance to the query. Another common control is preferably 
implemented as a more-button in this latter window that wnen 

2S activated, causes re-display of all prior data items merged 
with items newly received since the last window display. The 
newly received items are merged into the display list in 
order or relevance, and distinguished from prior items by, 
for example, their color or intensity so that the user can 

30 avoid scanning prior items again. Optionally, clicking on a 
data item opens a still further informational window giving 
either the source of the item, a display of the response 
containing the item, or access over the network to the source 
of the item, etc. 

35 in addition to such common functions and controls, a 

netbot user interface module preferably implements specific 
designs, formatting, and fields suitable to the information 

- 9 - 
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aomaln for which it is designed. For example , a netbot for 
comparison shoppin, in an infection domain - 
ZLs on the Internet can have a particular interface 
orientation containing labeled field, for product name 
5 Tl and price, on the other hand a nethot for access to an 
^oma lon doLin of online Xntemet inde.es can have an 
Inttrx.ce with fields labeled with elements derived free the 

"""a one netbot eminent, most functions and modules 
10 reside on network attached servers which a user accesses 

remotely. For e»mple. the user may access a nethot over the 
xnoern t with the world wide Web protocols otilUxn, ^ 
browser, such as Hetscape. In this case the user .nterface 

builds HTML formatted pages which are *«^*t- 

rm >A aenerally illustrat.es 
, . », » A .r +- Vie* T/o manager. rxg» J 

" r r.r dCla -cm an erample of such an embodiment which 
Norther directs to the information do.au, of online 
el ectronic software stores ^^l^L" 
divided into three sections. Section 11 is 
morally indicating ^^TT^T^ atso 
nethot for shoppy, a ••»»■«• ^ P 
r^ofTn"^ rrcurrery^U^epresented 
ml "WW addressee IS. which are select*!. , « .provide 
„ further information « ~ < £^£ B £ ^ ^Its 

"ce i^ eo UI^Ttted in accordance with this particular 
" option do~in into section, for the maior PC operat,,, 

Coition display is controlled with the window '-mug 
1 control facilities built into the web brow^. Th« user 
„ Interface is implemented as a HTML formatted page created at 
a nethot server and transmitted to the web bro>«er. 
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Fig. 2B generally illustrates the user display from 
another example of a web-browser-based user interface 
embodiment, which in this case, is directed to an information 
domain of WWW indexes or search engines. This display is 
5 also generally divided into three sections. Section 71 
displays a title for the netbot; section 72 displays the 
status of the current search; and section 73 displays the 
search results. In more detail, the display of section 71 
includes a logo for this netbot, "MC" standing for 

10 "MetaCrawler," a name chosen since WWW search engines are 
also known as "web crawlers," and controls to access certain 
system level presentation features, such the MetaCrawler home 
page and user feedback pages. The display of section 72 
includes list 74 of the search engines being queried 
identified by their coanen r.ases, th~ stat.us of the current 
query in general and at each search engine, and common user 
controls. Generally, pie-chart icon 78 summarizes that 7 of 
the 8 search engines queried have already responded to the 
query. At search engine 75, known as "Lycos," the check mark 

20 indicates that a response containing information items has 
already been received. At search engine 76, known as 
"Inktomi," the cross mark indicates that a response without 
any information items has already been received. On the 
other hind, search angina 77, known as "Galaxy," is visibly 

25 distinguished from the other search engines to indicate that 
it has not yet responded to the query. The common controls 
of section 72 include more-button 79 to request the display 
of newly arrived search results, and modify-search-button 80 
to request a new or modified query be sent. Lastly, the 

30 display of section 73 includes the information items returned 
from the search engines. Each information item is displayed 
separately and includes title 81, descriptive text 82 if 
available, and line 83 with the URL of the web page for this 
item and the estimated relevance of this item to the query, 

35 here "1000." The items are sorted for display by descending 
values of the estimated relevance. The displayed items are 
scrolled using controls provided by the web browser. This 
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user interface is implemented as a Java applet downloaded 
from a netbot server and executed by the web-browser. In 
this manner, the interface of Fig, 2B is capable of greater 
interactivity than that of Fig. 2A. For example, it can poll 
5 the netbot server for current search status and update the 
status displays accordingly without user action. 

Although the user interface is described primarily in 
terms of windows and buttons, one of skill in the art will 
recognize that this invention is adaptable to other display 
10 paradigms that provide for display of information and input 
If user commands. For example, the user interface module can 
control the entire screen and present graphical displays 
without intervention of a windowing system. 

The user interface module is preferentially implemented 
„ ooject r-.ricr.ted progrunia? language supplemented 
" with * class library providing windowing functions. A 
preferable implementation uses the Java language together 
with the java.awt package. See, for example, Flanagan, 1996, 
Tr> » HutsMU, O'Reilly * Associates, sections 5 and 19. 

20 

TWF TKTEG P^™* MODULE 

Fig. 3 illustrates the preferred functional modules, 
data bases, and their functional interrelationship of netbot 
3o in general and of an integrator module 37 in particular. 
» The in^grator preferably consists of three functions: a 
query router 39, a wrapper database 40, and an aggregation 
engine 38. These components are introduced here and 
described in detail in the following. Given user query 31 
delivered from user interface module 34, the integrator first 
30 calls query router 39 to rank the network information sources 
Known to a netbot in order of relevance and to return tta H 
*ore relevant sources. Hext the integrator retrieves the K 
wrappers for the N most relevant sources from wrapper 
database 40. These wrappers, which are descriptions of the 
35 information source and its requirements , are written m the 
wrapper description language of this invention. The 
retrieved wrappers are used by aggregation engine 38, first, 
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to format the query into forms recognized by each information 
source, and second, to understand the information returned by 
each information source 33 in order to eliminate extraneous 
formatting matter and to put the received information into a 
5 common format. This formatted information is then aggregated 
and passed to the user interface module for presentation 32 
to the user according to preferred incremental display method 
36. The user display is also controlled by stored user 
preferences 35. 

10 

THE QUERY ROUTER 

A well-behaved netbot should preferably use scarce 
network bandwidth and information source processing resources 
in a competent and efficient and frugal manner, while at the 
same time best answering eacn user query. Such behavior 
minimizes resource usage, and thus achieves best overall 
performance, both for the individual netbot and for all 
netbots functioning simultaneously on a network. 

The query router is important to achieving this behavior 

20 because it permits the netbot to send requests only to 

information sources likely to have information relevant to a 
query. From a user query, the query router determines the 
relevance of each information source to the given query and 
returns the N most relevant sources. N is a parameter 

25 controllable at user preference and can be as small as 1. 
This relevance determination is preferably over inclusive 
rather than under inclusive. Occasionally including an 
irrelevant information source is preferable to missing a 
relevant and important source. Further, it is preferable 

30 that this relevance determination be quick to compute, not 
requiring costly processing techniques* 

In a preferred embodiment, the query router calculates a 
numerical relevance rank value for each information source 
that estimates the source's relevance. This calculation is 

35 based on the concept of "conceptual classes. 19 Thus each 
information source is tagged in advance with the conceptual 
classes for which it is relevant. Then the query router maps 
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each query to the conceptual classes relevant to it, and 
finds information sources with conceptual classes shared by 
the query. The mapping of a query to its conceptual classes 
is preferably done with a hash function. 

5 

&ftr;pw - r ATin p TO6INE 

The aggregation engine is the coordinating function of 

the integrator module. It receives the user query from the 
user interface module and requests the query router to 
1. provide a list of the N information sources most relevant to 
the given query. Then it retrieves the N wrappers for the H 
information sources from the wrapper database. Guided by the 
N wrappers, the aggregation engine translates the query into 
the request formats accepted by each of the N information 
IS soloes and transfers the » requests to the I/O manag-r foi 
network transmission. For some sources, the query may be in 
the format of a form to be returned. When a response xs 
received from an information source, the aggregation engine, 
again guided by the appropriate wrapper, extracts data from 
2 0 tie response and places it into a list of data fields, called 
a tuple format, relevant to the particular information 
domain. Optionally, each tuple can be assigned a priority 
order using a method appropriate to the particular 
information domain. Finally when the incremental display 
25 manager requests data to present to the user, perhaps in 
response to a more-button request, the aggregation engine 
passes the tuples to the user interface module, sorted in 
priority order if a priority is determined 

For example, if the information domain relates to 
30 internet online software vendors, then the tuples optionally 
contain such relevant fields as product name, manufacturer, 
software version number, operating system required, price, 
etc. An exemplary priority order of the tuples can be by 
price, by delivery delay, or other factor at user preference 
35 For a further example, if the information domain relates 

to world Wide Web ("WWW") search engines, which index 
information pages available generally on the WWW, then the 
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tuples optionally contain such relevant fields as title of 
the indexed page, the universal resource locator ( W URL M ) of 
that page, relevance scores estimating the relation of the 
indexed page to the query, descriptive text, etc. An 
5 exemplary priority order can be based on the netbot 's 

normalized estimate of the relevance of the indexed page to 
the query. If the netbot does not retrieve the indexed page, 
then it sums the normalized relevance estimates for this 
response that are returned from each of the search engines. 

10 If a search engine does not return a relevance estimate, a 
default value is used. The obtained relevance estimates are 
then normalized by linearly adjusting the returned scores to 
have a common maximum of, e.g., 1000, and then multiplying 
the adjusted scores by a confidence factor. This confidence 

15 factor, ranging from 0 to 1, isd predecermined estimates or 
the reliability of a particular information sources own 
relevance estimates. For example, it can be determined by 
practical experience with the information source's relevance 
estimates. Alternately, the user can request the netbot to 

20 retrieve the page in order to do its own relevance estimate. 
In an exemplary embodiment, for queries requesting the 
presence either of all query words or of any query words, the 
estimate is determined by scanning the page and counting the 
number of query words actually present, and then scaling the 

25 count so that the presence of all words results in the common 
maximum relevance value. For queries requesting the presence 
of a phrase, the estimate is determined, for example, by 
subtracting from the common maximum a normalized sum of the 
square of the distance in the page of each word of the phrase 

30 from its successor word in the phrase. Thereby, if the 
phrase appears contiguously in the page the relevance is 
high, whereas if the words of the phrase are widely separated 
on the page, the relevance is low. 

In summary, for the majority of information domains, the 

35 priority order is determined from a relevance computation, as 
in the WWW search engine example. However, for certain 
domains such as online software vendors, a priority order can 

- 15 - 



BNSDOCID: <WO. 



.08l2fl8lA2„l_> 



WO 98/12881 



PCT/US97/17132 



be simply determined from' the values of one or more numeric 
fields of the response tuples. 

TTjF "P*Ppire DATABASE 
5 The preferred manner of describing information sources 

and their capabilities, in particular their query formats and 
response formats, is with compact, modular, declarative 
descriptions called wrappers. Since a netbot can access from 
several hundred to many thousands of information sources, the 
10 descriptions of the sources are preferably compact, requiring 
a minimum of storage. Further, since new information sources 
are frequently created and existing sources frequently change 
their format, easy maintenance of source descriptions is 
important. A modular, declarative description, instead of a 
x5 complex procedural description, tacilitates such maintenance. 
t» one embodiment of this invention, wrappers can be learned 
by a separate module for information sources having 
sufficiently regular formats. 

For each information source, in an exemplary embodiment, 
20 each wrapper advantageously includes the following 
information: 

!. The Universal Resource Locator (-OTL-) address of the 

information source; 
2 The conceptual classes of the source; 
2S 3 '. X description of the mapping from query arguments, m.g. 

words or phrases, to fields of the query or HTML defined 
form used to interrogate the source (including site 
support for any, all, phrase, or proximity queries); 
A description of the format of the query response or 
HTML page layout that enables parsing of relevant 
information from other information and extraneous 
formatting matter. 
At least items 3 and 4 are advantageously written in the 
wrapper description language of this invention. 
35 A netbot can retrieve wrappers in various manners. In 

one embodiment, the information source itself can supply its 
own wrapper upon request from a netbot. In an alternative 
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embodiment, the netbot can provide its own wrappers in 
various manners. For example, wrappers can be built in the 
netbot itself, especially where the netbot accesses only a 
few information sources. Also, wrappers can be stored in a 
S local database or can be downloaded on demand from a 
centralized database. 

The wrapper description language (hereinafter referred 
to as the "WDL") of this invention facilitates the semantic 
description of queries r forms, and pages by using a 

10 declarative description format that combines feature from 
grammars and regular expressions. Here an example of this 
description language is presented. A detailed description is 
set forth in a later section. Syntax used follows 
conventions known to those of skill in the art for specifying 

x5 ^rainmars, including regular expressions. Sec, s.g. r 

Schwartz, 1993, Learning Perl , O'Reilly & Associates, Inc., 
chapter 7; Aho et al., 1986, Compilers P rinciples. 
Techniques , and Tools. Addison Wesley Publishing Co., section 
3-3 

20 An exemplary description in WDL of a typical page 

returned from an WWW search engine follows here. The WDL 
interpreter uses the page description to parse a page and to 
execute any specified action statements. Note that "stuff" 
is a reserved word in the WDL that matches any character 

25 string up to the first occurrence of a mandatory following 
string literal. 

<page>::= stuff "<dl> M <item>* .*$ 

<item>::« stuff w <dt>" stuff ««\«« (stuff) »\ M xstrong> lt 
30 (stuff) M </strongx/a>« stuff "<dd>" (stuff) 

"<br>" { output ($0, $1, $2, 500) } 

This describes a page made up of, among other data, a 
sequence of zero or more items. In detail, a page is 
35 specified to consist of stuff, then the string <dl>, then 
zero or more items, then zero or more characters ("."), then 
finally the end of the page ("$") . in general, an item 
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includes three fields of relevance, denoted by (stuff) and 
referred to sequentially by $0, $1, and $2, that when an item 
is recognized are output by the -output ()• statement. In 
detail, an item is specified to consist of stuff, then the 
5 string <dt>, then more stuff, then the string then the 

first field of interest later referred to by $0, then the 
string <«><strong>' , then the second field of interest later 
referred to by Si, then the string "</strong></a>« , then more 
stuff, then the string «<dd>«, then the third field of 
10 interest later referred to by $2, then the string «<br>«. 
The action statement within braces, in this case specifying 
output of the variables $0, $1, and $2 with the bindings when 
an <item> is matched, is executed when each item is 
recognized. 

IS 

-rwy NBTBO? SYSTEM 

The preferred functional structure of a netbot can be 
assigned to system hardware components in various 
alternatives. The preferred alternative in any case depends 
20 on which allocation of function achieves a rapid response and 
reasonable cost. Fig. 4 generally illustrates exemplary 
netbot hardware embodiments and options in view of the 
previous general description. It illustrates the 
interrelationship of user computer elements 51-56, network 
25 57, information sources 58, and netbot server computers 59- 
61. computer 51 is a user computer including a processor, 
memory, and various attached peripherals. Such peripherals 
include display device 52, or other device for user 
interaction, network attachment 54, optional hard disk 
30 storage 53, and so forth. Computer 51 can be alternatively a 
network device without permanent storage, a PC, a 
workstation, or more powerful computer. It is preferred that 
computer 51 be a PC or a workstation running one of the 
Windows operating systems, the Macintosh operating system, or 
35 UHIX. Present in the memory of user computer 51 is, among 
other software, local netbot software 55 and local system 
components 56. The local netbot software implements one of 
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more of the netbot functions. The local system components 
can include, for example, a web browser. 

Network 57 can be any network with a plurality of 
attached information sources 58 , which are can be optionally 
5 conceptually classified by subject matter into information 
domains. In a preferred embodiment, network 57 is the public 
Internet or a private intranet supporting the TCP/IP suite of 
protocols, including such user level protocols as FTP, HTTP, 
and so forth. The information sources are server computers 

10 which make their stored information available using the 
protocols supported by network 57. Such information can 
include databases of periodicals, newspapers, etc., 
information on or produced by particular commercial, 
educational, or other types of organization, facilities for 

lb electronic commerce, etc. 

In such a network, a netbot can have various 
embodiments. In an entirely local embodiment, all netbot 
functions reside in local netbot software 55 on user computer 
51, which in this embodiment must have sufficient processing 

20 and storage capabilities. In alternative embodiments, one or 
more of the disclosed netbot functions can be distributed on 
other network attached computers. 

For example, computer 59 is a wrapper server for 
accepting requests f cr downloading wrappers from its wrapper 

25 database. The wrapper database can be stored in memory or on 
disk using any data management system capable of storing and 
retrieving compact textual descriptions. Computer 60 is a 
query server for performing query routing by accepting 
queries and returning the N most relevant information sources 

30 from the many tens or hundreds of thousands about which it 
stores information. Computer 61 is a netbot server for 
performing the integrator module function by accepting user 
queries and returning search results, perhaps using the 
facilities of wrapper server 59 and query server 60. With 

35 these network servers, local netbot software preferably only 
supports the user interface, which may be delegated entirely 
to a web browser. Alternately, it can further include the 
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aggregation engine, which makes query routing requests to 
query server 59 and wrapper requests to wrapper server 60. 
Further, it can include one or both of these latter 
functions. 

5 The various computers of a netbot system can be provided 

with software for performing the methods of this invention 
either from computer readable media or by loading across a 
network. This invention is adaptable to known magnetic and 
optic media, such as disks, tapes and CD-ROM. 

10 

5.2. SBE T (° MAKAGER 
The I/O manager module performs hardware, operating 
system, and network specific interfacing for the integrator 
module. Network interfacing includes the tasks of sending 
15 requests and receiving responses from network attached 

information sources. An important application of the netbots 
of this invention is to information retrieval over the 
internet. In this application the I/O manager is responsible 
for implementing the relevant protocols of the WWW, Gopher, 
20 FTP, internet tools, etc. Optionally, it can temporarily 
cache pages and other data in order to improve response time. 

Operating system interfacing includes the task of window 
management for the user interface module and access to the 
wrapper database, if present. 
2S Preferably, the I/O manager is constructed from 

commercially available protocol stacks, windowing libraries, 
such as the Java.awt package, and other tools. In some 
implementations, more or less of the I/O manager functions 
can be performed by other system components on the network 
30 attached computer, optionally, the I/O manager is designed 
to be scalable to multiple machines, to not require multi- 
threaded or reentrant code, and to be cross platform and 
persistent. 



35 
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5.3. THE AGGREGATION ENGINE 
In the preferred embodiment the functions previously 
identified for the aggregation engine component of this 
invention are performed by the following process • Searches 
5 of the information source proceed in parallel because all 
requests are transmitted without waiting for any responses* 

1. Receive a user query from che user interface module; 
2* Perform query routing to determine the N information 
10 sources most relevant to the user query; 

3. For each of the N relevant information sources, do: 
A. Retrieve the wrapper for this information source 
(for example, from the source itself or from a local or 
remote wrapper database) ; 

15 B. Guided by the wrapper, format the uset query into 

the form or format required by the information source; 
C. Transfer the translated command to the I/O module 
for transmission to the information source; 

4. Initialize the list of responses to be empty; 

20 5* Until a user specified time limit is reached, do: 

A. When an information source response has been 
received by the I/O module and transferred to the 
integrator, then: 

i. Guided by the wrapper for this information 
25 source, parse the response to understand the 

information returned, discard the site-specific 
formatting text and other irrelevant matter, and 
gather relevant fields into tuples; 

ii. Add each tuple to the list of tuples, 

30 optionally performing priority ranking, duplicate 

elimination, etc. ; 

B. Wait for the next response; 

6. Deliver the list of created tuples to the user interface 
module on request, which can be due, for example, to 
35 user activation of the show-me-button or more-button 

controls . 
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When multiple information sources are queried, it is 
preferable in step 6 to present to the user interface module, 
and thereby to the user, a single merged list of tuples 
extracted from the responses and sorted according to an 
5 estimate of the significance or relevance of each tuple to 
the user. Such a estimate is preferably made according to 
methods specific to the information domain to which the 
netbot is directed. For certain domains, a significance 
estimate can be made directly from the value of one or more 
10 data fields in the tuples. For example, in a domain of 
electronic shopping, significance can be related only to 
price or delivery date, according to user preference. For 
most domains, however, a significance estimate is made 
according to relevance of the returned information, which 
15 must be determined by examining the responses from each 
sources . 

in a preferred embodiment of such a relevance 
determination, the user has the choice of whether or not to 
have the netbot examine all information pages itself. If the 
20 user so chooses, the relevance is determined by the netbot 
according to a domain specific Analyze function. In a domain 
of information queries, an exemplary Analyze function finds 
the number and location of query words in the returned 
response. For keyword queries, responses with more keywords 
25 present with greater frequency are more relevant. For phrase 
based queries, responses having the words of the phrase more 
closely spaced, for example in one sentence or even in 
contiguous sequence, are more relevant. In other domains, 
appropriate Analyze functions are provided. 
30 If the user chooses not to have the netbot examine the 

responses, the netbot relies on relevance estimates returned 
from the information sources. If a particular source does 
not return a relevance estimate, a default value is used. 
These estimates are then normalized to be between, e.g., 0 
35 and 1000, and multiplied by a confidence ranking factor. 
This confidence factor, ranging from 0 to 1, is a 
predetermined estimate of the reliability of a particular 
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information source's own relevance estimates. For example, 
it can be determined by practical experience with the 
source's estimates . Where the same tuple is returned from 
two or more sources, the relevance values from all those 
5 sources are combined. Optionally, the relevance estimates 
returned from each source are adjusted to have a uniform 
distribution in the normalization range. 

In a particular detailed embodiment, this determination 
is performed according to the following process. Here, query 

10 routing has determined a list of K information sources, 
source Jc with k from 1 to K, and returned their confidence 
ranks, crankje. Each of these sources has been queried, 
returned responses, and K lists of information tuples, 
tuples_j where j is from 1 to lengthjc, have been extracted 

15 from these responses. The user's preference for netbot 
analysis is recorded in verification flag V. The variable 
t. score represents the composite relevance score for tuple t; 
the variable t.sourcescorejc represents the relevance 
estimate returned from information sourcejc for the response 

20 that tuple t was extracted from. 



25 



30 



35 
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input: List of K sources with their confidence rank pairs 
(sourcejc, crank_k) , obtained from the query 
routing system; K ordered lists of tuples, 
tuplesjc, of length lengthjc, obtained from source 
5 source_k; and the verification flag V (Boolean), 

obtained from user preferences 
Output: Merged list of all tuples sorted by relevance. 
/* Main routine */ 
IF (V is true) THEN 

XO FOREACH k ** 1..K 

FOREACH tuple t in tuples_k 

page = the HTML page that tuple t points to, 

downloaded if necessary 
t. score «= Analyze (page) ; 

J5 ELSE /* V is false */ 
FOREACH k = l.-K 

NormalizeScores (tuples_k) 

AdjustByHeight ( tuples Jc) 
Ad justByServiceRanking (tuples_k) 
20 /* Merge result tuples, t, into MERGED.LIST and 

determine a composite relevance score, t.score, from the 
scores returned by the information sources, 
t.sourcescore k; the sa»e tuple returned from multiple 
sources has its composite score incremented by the 
25 source score from each source */ 

FOREACH k = 1..K 

FOREACH tuple t in tuplesjc 

IF t is not in the MERGED_LIST THEN 
Add t to the MERGEDLIST 
t.score ■ t.sourcescorejc 

30 



35 
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ELSE 

t. score = t. score + t . sourcescore k 

ENDIF 

ENDIF 

5 SORT all tuples t by t. score and discard duplicates 
OUTPUT sorted tuples 
EXIT /* finished relevance ranking */ 
/* Subroutines */ 

SUBROUTINE Normal izeScores (tuplesjc) 
10 /* If information sourcejc returns relevance estimates, 
normalize them to fall in the range from 0 to 1000; 
otherwise, use a default relevance estimate */ 

/* w s w is the k'th information source's relevance score 
for the first tuple on the list of tuples from sourcejc 
15 */ 

s « tuplesjc [ l ] . sourcescore Jc 
IF (s — 0) THEN 

/* this information source returns no scores, 
therefore use default */ 
20 FOREACH tuple t in tuplesjc 

t.sourcescorejc = 1000 

ELSE 

scaling_f actor - 1000. 0 / s; 
F0REACH tuple t in tuplesjc 
25 t . sourcescorejc ■ t.sourcescorejc * 

scaling_f actor ; 

ENDIF 
ENDSUB 

SUBROUTINE Ad just By Height (tuplesjc) 
30 /* Adjust the source scores to have a uniform percentile 
distribution; for example for 10 tuples, the first tuple's 
source score is adjusted to 100% of its source score, the 
second tuple's source score is adjusted to 90% of its source 
score, etc. */ 
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percent_step = 100 / Lengthjc; 

percent_off = 0; 

foreach tuple t in tuplesjc 

t.sourcescorejt = t.sourcescorejc * (100 - 

5 percent_o£f) / 100; 

percantTofi = percent_off + percent_step ; 

END SUB 

SUBROUTINE AdjustByServiceRanking (tuplesjc) 

/* Each Service is assigned a percentile ranking indicating 
10 the confidence given to its returned source scores? this 

ranking is used to scale the returned source scores for each 

tuple; for example, a 90% confidence ranking means that each 

source score is scaled by 0.9 */ 
FOREACH tuple t in tuples_k 
, 5 t.sourcescorejc = t.sourcescore_k * crank Jc 

EKDSUB 

One of skill in the art will recognize that these 
processes are amenable to routine alterations and 
20 enhancements that perform the same functions in the same 
manners. In particular other values for the normalization 
range and default value for the information source relevance 
estimates can be used. This invention includes such routine 
alterations. These processes are preferably implemented m 
25 c ++ , but can alternatively be implemented in any procedural 
or object-oriented programming language. 

5.4. TTT* pPERY ROUTER 
The query router receives as input a user query 
30 expressed as a list of words or keywords and returns as 
output a list of N information sources ordered by their 
likely relevance to the input query. Determination of these 
information sources is optimized for speed and over 
inclusiveness. Occasionally including an irrelevant 
35 information source is preferably to omitting a relevant 
source. 
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The preferred query router is based on the principle of 
assigning relevant concepts to information sources and query 
words* In advance, a set of concepts is chosen to describe 
the information sources of the one or more information 
5 domains to which one or more netbots are directed. For each 
information source in the domains, the relevance of that 
source to each of the chosen concepts is judged. Further, 
each word that can appear in possible queries is examined to 
determine which of the chosen concepts are relevant to the 

10 word. Then, upon receiving the words or keywords of a query, 
the concepts associated with these words are determined, and 
then the information sources relevant to these concepts are 
found. The ranked relevance of each source is determined by 
combining the individual relevances of the source to all the 

15 concepts of the query. The case of phrase based queries is 
preferably handled by generating separate data for this query 
type. 

The preferred implementation of this process utilizes 
four tables containing relevance information. In the 

20 following, W is chosen to be somewhat bigger, e.g. io%, than 
the number of words that can appear in possible queries; C is 
the number of chosen concepts; and S is the number of 
information sources in the information domain. 
W0RD2C0NCEPT[] is a table of W vectors of C bits, where the C 

25 bits of the vector for a word indicate which of the C 

concepts are relevant to that word. CONCEPT2 SOURCE [ ] [ ] is a 
C by S table. For each of the C concepts and S sources, the 
corresponding entry of this table contains the relevance 
value of that source to that concept. For example, if entry 

30 <i,j> equals 5, the j-th information source has a relevance 
weight of 5 with respect to the i-th concept. 
C0NCEPT2 SOURCE [] [] is used when searching by words. For 
searching by phrases, the table CONCEPT2PHRASE[ ] [ ] similarly 
relates concepts to sources. Finally, DEFAULT-RELEVANCE [] 

35 has a default relevance weight for each of the s information 
sources. 
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The preferred implementation performs the following 
process . 

For each of the S information sources, set RELEVANCE [ j } 
- DEFAULT— RELEVANCE [j ] ; 
For each word in the user query, do: 
A compute a hash function on tne word obtaining a 
number, M, between zero and W. Any suitable hash 
functions is adaptable to this process. An exemplary 
hash function is found in Sedgewick, 1990, Algorithms m 
C Addison-Wesley Publishing Co., chapter 16. 
b' Let the C bit vector V equal to W0RD2 CONCEPT [M] ; 
C* Combine the relevances for all the concepts in V to 
the relevance for the information sources by perf ormxng: 
For i from 0 to C, do: 
If (i-th bit of V is '1') THEN 
FOR j ■ 1 to S DO 
RELEVANCE [ j ] - REL»EVANCE[ j ] + 
C0NCEPT2 SOURCE [ i , j 1 
Monotonically increasing function other than " + « can 
also be used to combine the individual concept 
relevances into a final relevance; 
D combine the relevances for all the words xn the 
user query, for example by adding them together, 
in the case of searching by phrases, additionally do; 
A. concatenate all the words of the user query phrase; 
B Compute the hash function on the phrase and 
obtaining K, and set the C bit vector V equal 
WORD2CONCEPT [M] ? 

c combine the relevances for all the concepts xn V to 
the relevance for the information sources by performxng: 
For i from 0 to C, do: 
If (i-th bit of V is '1') THEN 
FOR j = 1 to S DO 

RELEVANCE [ j ] = CONCEPT2PHRASE[i, j ] 
Sort information sources based on their RELEVANCE , and 
return the N most relevant sources. 
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One of skill in the art will recognize that this process is 
amenable to routine alterations and enhancements that 
performs the same functions in the same manner. This 
invention includes such routine alterations. 
5 This process is preferably implemented in C++, but can 

alternatively be implemented in any procedural or object- 
oriented programming language. In the case of a query router 
which maintains information on * large number, e.g., tens of 
thousands, of information sources, the query router is 

10 preferably implemented as a server process on a server 
computer to accommodate the size of the required data 
structures and the processing requirements of query routing. 

An exemplary construction of the table WORD2 CONCEPT 
begins with the selection of concepts to characterize the 

i£ information domain of interest and the determination of words 
and phrases likely to occur in user queries. For each 
concept, the following actions are performed. The words and 
phrases associated with that concept or to which the concept 
relates are assigned to the string arrays KEYS[] and 

20 PHRASES[]. Then the following process is carried out. 

FOR i equals 1 to the number of elements in KEYS[], DO 
Apply the previously used hash function to KEYSfi] 
to obtain a number between 0 and W 
25 SET the bit matching the current concept in 

WORD2CONCEPT[M]. 
FOR j equals 1 to the number of elements in PHRASES[], 
DO 

Apply the previously used hash function to KEYSfi] 
3° to obtain a number between 0 and W 

SET the bit matching the current concept in 
WORD2CONCEPT[M] . 

These actions are repeated for each chosen concept. 
35 Alternatively, concept information can be stored along with 
the string information, using either open or closed hashing, 
in order to preserve accurate string to concept matching. 
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5.5. ™™ WTMIPPER T>TTTKTTTO K IAKGPAQE 
This section first presents introductory material on the 
use of the WDL in wrappers. Next, the two principal 
5 components of the WDL - the action language and the regular 
expression language - are described in detail. Finally, an 
exemplary embodiment of the WDL is presented. 

A wrapper is a description of an information source and 
how a Hetbot should interact with it, in particular how to 
10 format requests to it and how to understand responses from 
it NetBots need to interact efficiently with hundreds to 
many thousands of information sites on a network. This 
interaction presents two requirements: first, compact storage 
of a description of an information source using a 
representation encoded in the WDL; and second, use of tins 
" description to understand the information source. Currently, 
for example, netbots need to format requests to a source and 
to parse useful information from the pages returned by the 
source while ignoring irrelevant formatting information. As 
20 information sources become more functional, netbots will need 
to process interactions more complex than simple request- 
response pairs.. Therefore, wrappers and the WDL preferably 
have the following features: 

1) WDL descriptions are easy to write, easier than, for 
25 example, in C++. This is important because new information 
sources are created frequently and existing services change 
their format frequently. Optionally, wrapper descriptions 
can be automatically generated using machine learning 
techniques in information domains where responses have a 
30 regular, predetermined format. 

2) Wrapper description are small. This is important so 
they can, for example, be stored locally in a database or 
quickly transmitted from a server to a netbot running in a 
client, even over a slow network connection. Optionally, 
35 information sources can supply their own wrappers on request. 
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3) WDL descriptions can be automatically compiled into fast 
finite state automata that quickly parse the information 
returned by information sources. 

4) Using wrappers with WDL permits netbots to adapt to new 
5 types of information formats and new types of information 

server interactions in the future . 

The wrapper description preferably specifies at least 
two processes: first, requesting information from an 
information source , e.g., how to fetch the appropriate HTML 

10 formatted page from a source? and second, how to parse 

returned pages to extract the relevant data. To perform the 
first process, the WDL includes an action language component, 
which is an extensible language of expressions and 
statements. To perform the second process, the WDL includes 

15 a regular expression language component, ^hich is an 

extensive and novel means of specifying regular expression 
pattern matching facilities. 

In alternative implementations, netbots can utilize 
alternative pattern matching facilities known to those of 

20 skill in the art. For example, the regular expression 
component could be replaced with a context free language 
specification ("CFL") . In this case, implementation of the 
WDL can follow techniques known for construction of compiler 
compilers, for example YACC. However, where possible, 

25 regular expression pattern matching is preferred because of 
its straightforward specification and rapid execution. 

An Example Of The Regu lar Expression Component 

Regular expressions are advantageous for describing the 

30 format of the information returned from many information 
sources. The regular expression component of the WDL 
augments prior regular expression matching facilities in 
several novel manners. First, it permits programming 
language facilities of the action language, e.g. statements 

35 and expressions, to be executed when regular expressions are 
recognized and with variable bindings as determined by 
partial matches recognized during the overall recognition. 
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Second, the preferred implementation compiles regular 
expressions into compact and efficient finite state automata. 
Third, it encourages the efficient and intuitive expression 
of complex regular expression in a nested manner. And 
5 fourth, it possesses an efficient backtrack- free padding 
facility. 

An exemplary wrapper follows which is suitable for an 
internet information search engine and is written in the WDL. 
The significance of each portion is explained in comments, 
10 which are surrounded by •/*" and ■*/•" 

/* A list of input query words is passed to the wrapper 
from the aggregation engine, which is executing the 
wrapper for the information source. The argv() function 
of the action language extracts a lint, of the input 
query words. */ 
{$keywords - argv(2) ; 

/* The request to be sent to the information source is 
calculated by concatenating, indicated by the "." 
operator of the action language, three separate strings: 
one, the Internet URL address of information source and 
the initial query format string; two, the query words to 
search; and three, the remainder of the query format 
string for this information source; and */ 
$ur 1 = "http : / / searcher . source . com/ searcher . cgi?guery 
=« . $keywords . "ionlyrr-O' 



15 



20 



25 



30 



I 



I* The fetch action statement transfers this query the 
I/O module for network transmission, and then waits for 
the HTML formatted response. */ 
$page = fetch (0 , $url ,")', 

/* The HTML formatted response text is parsed using the 
35 following regular expression grammar. */} 

$result = parse (Spage, <page>) ; } 
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/♦An response from this exemplary source consists of a 
page containing zero or more information items. This is 
hierarchically described in the regular expression 
language by expressions for <page> and <item>, where 
5 <page> refers to <item>s. In particular, a page 

includes a chunk of text followed by the string "results 
returned ..." , then followed by zero or more items. */ 
<page> stuff -results returned, ranked by' 

<item> * END 

10 

/* Each item consists of irrelevant text r HTML 
formatting codes, and relevant data fields. The 
relevant data fields are enclosed in parentheses and 
referred to sequentially by $0, $1, and $2. These 
15 fields, along with the number 500, are output as a tuple 

when the M output r statement of the action language is 
executed upon recognition of an <item>. The particular 
meaning of the definition of <item> will become apparent 
in the next section. */ 

20 

<item> ::= stuff '<hr>' stuff '<centerxbxa href= f " 
(stuff) /M ' stuff '>' (stuff) '</a>' 
stuff '<font size»-l>' (stuff) '</font>' 
{ output ($0, $1, §2, 500); } END 

25 

5.5.1. DESCRIPTION OF THE WDL COMPONENTS 
This section describes the preferred features of the 
action language and the regular expression components of the 
WDL. This invention is not limited to the descriptions 

30 presented. These descriptions make use of known methods for 
describing grammars, including regular expressions, and one 
of skill in the art will recognize that there are many 
substantially equivalent descriptions. Such equivalent 
descriptions include, but are not limited to, those resulting 

35 from renaming the described syntactic elements or from 

applying known grammatical transformations to the presented 
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syntax. This invention comprehends also these substantially 
equivalent descriptions. 

Th<> ^ction Lancniaoe Co mponent 
5 The action language includes certain preferred base- 

level features: an assignment statement; sequencing 
statements such as "if" and "while" constructs; a tuple 
output statement; string and numeric expressions; string, 
numeric, and boolean operators; and certain built-in 
10 functions. It will be apparent to those of skill in the art 
that a base- level action language of equivalent semantic 
expressive ability can be constructed with slightly different 
choices of features. For example, the while statement can be 
replaced by a goto statement. The action language component 
15 of this invention comprenends such known equivalent 
formulations of the semantics disclosed. Further, in 
optional embodiments, the preferred base- level features can 
be augmented with such additional features as: additional 
sequencing statements such as "for" or "repeat" loops; user 
20 defined functions; additional string, numeric, and boolean 
expressions and operators such as are found in C or C++; 
additional built-in functions; and array variables. Such 
additional features can be added to the preferred base-level 
language in known manners. See, e.g., Aho et al. , 1986, 
25 retailers Principles. T echniques, and Tools, Addison Wesley 
Publishing Co. 

The syntax of the preferred base-level action language 
is given by the following grammar expressed in a standard 
notation. 

30 



35 
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Statement : := 

$VariableName "=" Expression ";" 
j "if" Expression Statement [ "else" Statement ] 
J "while" Expression Statement 
5 1 "{" Statement* "}" 

| "output" "(" [ Expression {", "Expression }* ] ")" "; w 

Expression : : = 

string-constant 
10 | f loa t-const ant 

j $variableName 
| Expression op Expression 
} "(" Expression ")" 

| functionName " (" [ Expression {"."Expression }* ] ")" 

15 

Thus, the action language component is defined in terms 
of statements and expressions. A statement can be: an 
assignment statement, which assigns a new value to a scalar 
variable; an if statement; a while loop; a compound 

20 statement, which is a sequence of other statements enclosed 
by "{" and "}"; or an output statement. Except for the 
output statement, the statements function in the same manner 
as in other procedural programming languages, e.g., C, C++, 
or Pascal. For the if and while statements, the conditional 

25 argument is considered true if it is either a non-zero number 
or a non-null string, i.e., not the empty string "". The 
OUTPUT statement allows a wrapper to return information 
matched in responses fro* the information source to the 
netbot module executing the wrapper. For example, executing 

30 the statement, "output (arg_l, arg_n) " causes the tuple 

<arg_l , . . . , arg_n> to be returned from the wrapper to the 
netbot . 

An expression can be: a string constant, which is a 
symbol string surrounded by quotation marks; a floating point 
35 number; a variable name, which is a name or an integer 

preceded by a dollar sign; an infix operator applied to two 
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surrounding sub-expressions? a sub-expression, which is 
enclosed in "(" and ")"; or a call to a built-in function. 

Further, the language provides operators standard in 
procedural programming languages, including the following: 
5 arithmetic operators <•'+«, »--, "*". H /"); numeric comparison 

operators ("<", — , ">", »<-", ">-'< " != ">'* strin * 
comparison operators ("It", «eq«, "gt", "le", «ge", «ne«); a 
string concatenation operator («."); and boolean operators 
(•»&&", "||"# "!")• These operators have the following 
10 semantics: 

1) +,-,*//*• perform the indicated arithmetic operation 
on the surrounding expressions; 

2) . : concatenate the surrounding strings; 

2) =,<=,>=,<,>,!=: perform the indicated numeric 
15 comparison of the surrounding numbers and return tac 

floating point values 0.0 or 1.0 accordingly; 

4) eq, le, ge. It, gt, ne : perform the indicated 
character-by-character comparison of the ASCII codes of 
the surrounding strings and return the floating point 

20 values 0.0 or 1.0 accordingly; 

5) i : return 1 if argument is the number 0 or the string 
»" f otherwise return 0; 

6) && : evaluate the first argument, if the first argument 
is zero or "", stop and return that value, else 

25 evaluate the second argument and return the latter 

value; 

7) | | : evaluate the first argument, if first argument is 
neither zero nor "", stop and return that value 
immediately; else evaluate the second argument and 

30 return the latter value. 

These operators have the following precedence in order from 
highest to lowest: 

1. any expression inside parentheses (highest precedence) 

2. *, / 



35 3. +, -, 
4. 



<=, >=, <, >, ! = , ne, eq, le, It, gt, ge 



5. I (boolean not) 
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6. && (boolean and) 

7. || (boolean or) (lowest precedence) 
All operators are left-associative. 

No variable declarations are needed. First, all 
5 variables in the action language are distinguished by being 
preceded by a "$«. Those special variable denoting sub- 
strings matched by regular expression consist of a w $" 
followed by an integer, e.g., $0 or $1. Second, at run time, 
automatic type conversions, between string and floating point 

10 types, is done dynamically. If an operator expects a number 
but gets a string, the string is converted to a number by 
calling the C library function atof () , which converts the 
ASCII representation of a number into its internal 
floating-point representation. If an operator expects a 

IS string argument but gets a number, it uses the C library 

function sprintf ( . . . , «%f «) to convert it to a string. Third, 
referenced variables that have not yet been assigned are set 
the default values 0 or ww , as appropriate. 

The action language component has several built-in 

20 functions. The following are preferred base- level built-in 
functions. 

1) argc(), argv(): when a netbot executes a wrapper, 
it can pass the wrapper one or more arguments. These usually 
represent query parameters, query words, or query keywords 

25 supplied by the user. These arguments are accessed from 
within the wrapper language by the functions argc() , which 
returns the number of arguments passed to the wrapper, and 
argv(n), which returns the n-th argument. 

2) fetch(): This function preferably interfaces with 
30 the I/O module to transfer a string containing the network 

address of an information source and perhaps query parameters 
over a network to the addressed information source according 
to the proper protocol, and returns a string containing the 
response of the information source. Wrappers use this 
35 function to query information sources and retrieve pages. 

3) parse (<string>, <nonterminal>) : This function takes 
a string and attempts to match it to the regular expression 
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corresponding to the given <nonterminal> . as defined by the 
regular expression language component of the WDL. This 
function returns -1- if the match was successful, or xf 
2 string did not match the reguiar expression ^ers 
5 use this function to parse the response of an information 

SOUrC In exemplary wrapper includes the following sequence of 
^^^..t- F i rs t is a series of statements 
action language statements. First is 

using argc() and argv() to obtain the user query parameters 
10 and to initialize a string variable, e.g., $url, with a 
string containing the URL of the appropriate page at the 
information source together with an appropriately formatted 
^ery. Next, the assignment statement «*html_text - fetch 
fetches the query-response page into another string 
IS variable, e.g., $html_text. Finally, the function 

parse( $ html text,<page» attempts to match the returned html 
text with the regular expression, < P age>, which describes the 
sought page. 



20 „ B&galftE ™^™J?TZ *~*-> component of the «. 
The regular expression ( n reg exp j ^ ^ 

matches strings against regular expressions. It has been 
found that regular expressions are convenient to scribe the 
format o£ responses returned from « wide range of information 
„ sTce. However, the reg-exp Xanguage of tbi. invention is 
capable of matching tais information so that relevant fields 
can be extracted in a more rapid and more convenient manner 
tl can prior languages ^ systems, such as » or Perl^ 
The regular expression component includes novel faculties 
„ that solve the problems with prior matching systems, and 
thereby, permit its use by netbots. 

A first novel facility allows the specification of 
reoular expressions to be broken into pieces in a manner 
regular expre*. „„_„, which specifies a languege 

similar to a context-free grammar, wnicn spe 
„ by a set of rules for nonterminals in the grammar, see. 
e g Aho et al.. 1*06. g UTl'"" Tr^W. 

Addison Wesley Publishing Co.. section 4.2. writing a 
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single regular expression to represent the format of a page 
from an information source, as is required in prior systems, 
often results in a very large and cumbersome expression, one 
which is difficult to write, understand, and maintain. To 
5 solve this problem of existing systems, the reg-exp component 
specifies a regular expression by a set of rules for 
components of the regular expression. These components are 
labeled by nonterminals. However, in contrast to 
context-free grammars, the set of rules in the reg-exp 

10 component are not allowed to be recursive or 

mutually-recursive. In other words, the rule for a 
particular nonterminal cannot directly or indirectly refer to 
other rules which refer to that particular nonterminal. 

The following exemplifies the use of a set of rules and 

15 nonterminals. A top-level nonterminal defining an 
information response can be: 

<page> :: = <head> <item>* <tail> END. 
which specifies that the response is a page consisting of a 
head, followed by zero or more items, followed by a tail. 

20 The keyword "END* denotes the end of a rule. The second- 
level nonterminals on the right-hand side ("RHS") of this 
rule, <head>, <item>, and <tail> are defined by their own 
rules: 



<head> 

25 <item> 
<tail> 



- "Results of your search: \n M END; 



= "Data: . . .\n" END; 
«= »No more results\n w END. 
To execute these rules, the reg-exp component compiler 
substitutes into the RHS of <page> the RHSs of the rules for 
<head>, <item>, and <tail>. The result is as if the wrapper 
30 contained the large, cumbersome composite top-level rule: 

<page> ::« "Results of your search: \n" ("Data: 
. ..\n")* "No more results\n" END 
If the second-level rules had contained further nonterminals 
on their RHSs, the compiler would continue making appropriate 
35 substitutions until there are no more nonterminals on the RHS 
of the composite rule for the top-level rule. Because of the 
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lack of recursion or mutual recursion, this substitution 

terminates . - 
A second no».l facility permits skipping creeps cf 
chapters in a string without backtracking. In prior 
5 roller expression matching systems, this guite common 
retirement is implemented in a manner that requires the 

cf many backtracking points on a stack and can lead 
riTensive backtracking. Per example, the prior art Perl 
idiom ".<•". which matches any number of occurrences of any 
10 character, or the Perl idiom TW'* Mt f 

non-digit character, up to the first occurrence of . digit, 
^„ cause extensive back-tr.ckin, during a match. This is 
"efficient and not preferable, since information responses 
-1 be rapid* parsed. See e.g scheart, . 

" Xr^^T'* and direct 
syntax for this common idiom: 

stuff "literal-string" 
vfcere -stuff" is a reserved word, stuff is define 
„ ell cheracters fro. the current cheracter, up to but not 
including, the first occurrence of the string 
.literal-string." which must be e string literal. This 
construct allows a compact, efficient, backtrack-free 
implementation cf this common and important 

A third novel facility allows relevant data fields to be 
extracted from arbitrarily complex regular 
Action lan^ag. statements can be embedded ««W °f 
nonwrminel for execution whenever that nonterminal is 
.etched end with the variable bindings current at each 
„ occurrence of a metch of that nonterminal. In the case of 
the previous example, the definition for <item> can be 

extended as follows: 

<item> :• = "Data:" (stuff) "\n" ( output ($0) ; > EHD. 
Whenever <item> is matched, $0 is set to the string matched 
25 ^ n s :rf in the parenthesis, and then the output statement is 
execute in this case, $ 0 gets hound to whatever characters 
followed "Data:" and preceded the newline character. 
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Prior systems, for example AWK and Perl, do not have 
such a facility. Although they do set variables, such as $0 f 
$1, etc., they do so only after matching the single composite 
regular expression defining <page>. in order words, the 
5 entire page is matched before variables are set. Therefore, 
in AWK and Perl, if there is more than one <item> on the 
<page> f the relevant data fields in all but the last <item> 
are lost. The reg-exp component of the WDL solves this 
problem by allowing specification of actions that are 

10 executed multiple times, once for each nonterminal match, 
with different variable bindings each time. 

Turning now to the description of the reg-exp language, 
it includes certain preferred base-level features: definition 
of nonterminal regular expressions; embedding action language 

15 statements in regular expression rules; operators lor 

expressing alternation and repetition; and literal string 
match. It will be apparent to those of skill in the art that 
a preferred base- level reg-exp language of equivalent 
semantic expressive abilities can be constructed with a 

20 slightly different choice of features, and the reg-exp 
language component of the WDL includes such well-known 
equivalents. In particular, the reg-exp language comprehends 
variable renamings and known grammatical transformations 
applied to the rules below. 

25 In optional embodiments, the preferred base features can 

be augmented with such additional features as: a special 
disjunction repetition for an arbitrary integer; 

matching to character classes; local string memory; and 
anchoring characters . Such additional features are standard 

30 with the exception of the special disjunction, which causes 
the listed alternatives to be matched simultaneously as the 
string is parsed, the first alternative to match being 
returned, in contrast, the regular disjunction matches each 
alternative in the sequence listed, backtracking until 

35 success is found. The additional features can be added to 
the base-level language in known manners. See, e.g. 
Schwartz, 1993, Learning Perl. O'Reilly & Associates, Inc.; 
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Aho et al.. 1986, SQSEileES Princip le, Techniques, apd 
Tools, Addison Wesley Publishing Co.; Hopcroft at el., 
^^ to »ntn-f Theory , T ^es, and ^t^n^ 

Addison Wesley Publishing Co. 
5 m a further alternative embodiment, declarations could 

be added to control backtracking in order to improve 
performance of the regular expression matching. If a certain 
Lrtion of a rule were known to require no backtracking, that 
Lrtion can be bracketed with declarations instructing the 
M compiler and interpreter to generate a finite state machine 
without any provision for backtracking. This permits that 
portion of the rule to be matched more efficiently than 

otherwise. % 

The syntax of the preferred base-level action language 
is is given by the following grammar expressed in a standard 
notation. 

Rule ii- nonterminal Regexp "EHD* 1 

Regexp **- Sequence ( "|" Sequence )* 

20 sequence :: - Repetition* 

Repetition : :- Term ■?■ 
| Term "* M 
| Term n +" 
\ Term 

string_In_Oouble_Quotes 

| "stuff" 
| «(" Regexp W ) M 
| <nonterminal> 
| Action 

30 . Action ::«"<" Statement* «>" 

Briefly, a rule specifies a particular <nonterminal> to 
recognize a particular regular expression. A regular 
exprfssion can include: disjunctions C - | , sequences; ero 
as or more repetitions (•+•)! one or more repetitions ( ♦ ), and 
Tero or on! repetition (•*•>• Terms can 

enclosed hy double quotation marks; the special symbol stuff, 
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parenthesized regular expressions; nonterminals, which must 
be defined another rule; or statements written in the action 
language . 

5 The Wrapper Description Lang uage 

A complete wrapper description can be: 
WrapperDescription : :« statement Rule* 
Thus the preferred WDL entities include a statement in the 
action language, which is typically a sequence of action 

10 language statements, followed by an optional set of rules in 
the reg-exp language, which define a regular expression for 
matching a response returned from information sources. 

To execute a complete wrapper description, the statement 
is executed as is described subsequently. Typically, the 

IS statement contains calls to the built-in functions fetch {) 
and parse (), among others. The parse {) built-in function 
attempts to match a response returned by fetch () by invoking 
a nonterminal defined by the appended rules, each of which 
defines a regular expression. If the regular expression 

20 match is successful, all the action language statements 
typically embedded in the regular expression are executed, 
typically some action statements embedded in lower level non- 
terminals are executed multiple times with the operand 
bindings current at each occurrence of a match, and the 

25 parse() function returns the value M l". If the match is 
unsuccessful, none of the embedded statements are executed, 
and the parse () function returns the value 11 0 M . 

5 *5.2. IMPLEMEN TATION OF THE WDL COMPONENTS 
30 The preferred implementation of the WDL is described in 

this section under the following headings: (l) parsing 
regular expression rules, (2) intermediate code generation 
for regular expressions, (3) run-time interpretation of 
regular expressions, (4) action language code generation and 
35 run-time interpretation, it is understood that this 

preferred implementation of the WDL is exemplary. It is 
known to those of skill in the art that alternatives exist to 
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the implementation describee herein. For example, the 
processes described can be implemented differently, e.g. . 
with different variables and different ordenngs of the 
individual steps. Also, the processes may 
5 alternative algorithms to achieve the same affects for, e.g.. 
string matching, for a further example, the described 
imitation discloses processes of interpreting various 
intermediate codes. Different intermediate codes can be 
usad. For the regular expression language, different types 
10 of nodes are possible including, e.g., nodes having a 

variable list of successor ststes to avoid having to maintain 
rLcxtracking stack of successor states. For the action 
lan^e instead of the disclosed address-based intermediate 
ccX e^ivalently a reverse Polish stacR-based intermediate 
^e could be used. Finally, instead of intarpretat.cn it 
is possible to compile directly to a machine language using 
the disclosed syntax-directed methods. 

in the remainder of this section, a pars, tree node that 
has label "1", contains data -d-, and has child nodes «=_!. 

so Tn- is denote by -l<d ; c.l c.n>». Thes. parse tree 

nUes ;an be constructed and referenced in way known in the 
art. e.g., by a pointer to a data area contain, node data. 

r mf F» »«-.gi inr mnm« gTn " , ° LES . 

„ The first step in the preferred implemantatron of 

compiling a wrapper description is parsing the rules of the 
r^-exP language and the statements of the action language 
Z iiut to this step are the rules of the reg-exp language 
TfiZ, the regular exprassion to match. The output fro. 
„ this stlp is intermediate code in the for. of a set of pars. 
TZs one parse tree for the top-level rule and additional 
parse trees for each lower-level rule. 

I„ , preferred ..fcodlment. the process of this step is 
performed by parsing according to a recursive descent parser 
3S R emitting Le parse tree according to a syntax-directed 
translator. Construction of a recursive descent compiler for 
the previously described syntax of the reg-exp ^guage is 
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well known to those of skill in the art. For example, it is 
clearly disclosed with examples in such textbooks as Aho et 
al., 1986, Compiler s Principles, Techniques, and Tools , 
Addison Wesley Publishing Co. at section 2.4 and 4.4. Syntax 
5 directed construction of parse trees is covered with examples 
in section 5.2. This invention is not limited to this 
preferred embodiment. Alternative parsing techniques are 
known in the art, and this invention comprehends embodiments 
using such techniques, such as LL or LR parsing, see 
10 generally Aho at al. at chapter 4. This invention also 
comprehends alternative techniques of intermediate code 
generation known in the art. See generally Aho et al. at 
chapter 5. 

The specification for the syntax-directed analysis 
15 includes rules for the parse -tree nodes to be created when 
each syntax rule is recognized by the parser. For each of 
the previous reg-exp syntax rules the following nodes, 
labeled by the nonterminal of the rule, are created: 

20 1. Rule: Create the node "Rule <nonterminal-name; 
node-for-regexp>." This is the node for an entire rule 
labeled with its nonterminal name. 

2. Regexp: Create the node "Alternatives 
<node_for_sequence_l, node_f or_sequence_n> . « This is 

25 the node for a disjunction ("|") of alternative regular 
expressions, which are tested for a match in the order 
listed, the first successful match being returned. 

3. Sequence: Create the node "Sequence 

<node_for_repetition_l node_for_repetition_n>.« This is 

30 the node for a sequence of regular expression patterns each 
of which must sequentially match. 

4. Repetition: Create the node "Repetition <type; 
node_for_term>." This is the node for a repetition, where 
type is one of "?" (0 or 1 repetition), (l or more 

35 repetitions), or "*" (0 or more repetitions). 
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5. string.In.Douhle.auot..: Cr..t. th. nod. -Sttin, 
<liter.l_strin,>." This is the node to -tch a literal 

r in l«f £! create the node -Stuff <>» for the reeerved word 

5 "StUff." 

7 (Regexp) : Create the node -Sequence 

7. ^egexp, cioseParenthesis<i»." 
<OpenParenthesis<i>, node J: or_Regexp , cios _^ Sallv 
^is is the node which causes assignment to a sequentially 
numbered variable on a match by -Regexp." The node 
l0 .OpenParenthesis<n>« is a node representing the n'th open 
naxenthesis that is encountered in sequentially parsing the 
parenthesis tna „ close Parenthesis<n>« is a 

noc 0 f the current rule, me tuxc 

rhs ot tne ^ ,. _ _, th close parenthesis 

node represented the corresponding n th close P 

encountered. 

l5S . <„onter.inai> : Create the nod. -nonterminal 

<„onter.inal_name>." for an instance of a nonterminal, 
o Action: create the node "Action 

Inode Tor action_language_statement> . » This is a node for 
an in™ of an action language statement ,n ^J** 
20 node for action.language.statement represents xntermediate 
code for the action language statement. 

These rules., which are executed when the parser 
recognizes the corresponding reg-exp language syntax rule 
r ^ . . ftf a D arse tree, which is formed from 

2 5 cause the generation of tpc- ^ structure . This parse 
the listed node types linked m a true s.«u 
tree is input to subsequent steps of this process. 



-e 

code generation in the preferred emooauu 
three steps: 

special preprocessing of stuff nodes; 
Eliminating occurrences of nonterminals m the RHS 

of i-he too- level nonterminal? and 

3. converting the resulting PBS of th. tcp-level rule 



30 

following three steps: 
1. 
2. 



35 - • 

into a finite state automaton ( " FSA ). 
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These steps are, first, generally described, and second, 
described in detail. 

First, preprocessing of stuff nodes is necessary to 
later use the preferred postorder traversal method of FSA 
5 generation. As an FSA cannot be built for a stuff element 
alone, it is necessary to. build an FSA for the combination of 
a stuff node and the literal string node that must follow 
each stuff element, only open and close parentheses can 
intervene between a stuff node and its following literal 
10 string. To do so, the parse tree is traversed and each stuff 
node is merged with the following literal string by replacing 
both nodes with a new Stuf f_And_String node. The new node 
also accounts for any intervening parentheses. 

The second step in code generation substitutes all 
15 nonterminals in the RHS of the top-level not.terminai wich 
their respective RHSs to generate a single composite rule for 
the overall regular expression. During this substitution, it 
is necessary to renumber occurrences of variables in action 
language statements, 
20 For example, consider the wrapper 

<page> ("dave") ("bill") <a> ("dan") { 

output ($2) ; } END 
<a> "oren" (stuff "cody") { output ($0); } END 

Here, $2 in the output statement in <page> references 
25 ("dan"), while $0 in the output statement in <a> references 
(stuff "cody"). After substitution, if variables in the 
output statement were not renumbered, the RHS of <page> would 
be expanded as follows: 

<page> ("dave") ("bill") "oren" (stuff "cody") { 

30 output($0); } ("dan") { output($2); } 

Here, however, $0 refers to ("dave") instead of (stuff 
"cody"), while $2 refers to (stuff "cody") instead of 
("dan"). To maintain correct variable assignment, the 
variable references must be renumbered as follows: 
35 <page> ::« ("dave") ("bill") "oren" (stuff "cody") { 

output ($2); > ("dan") { output ($3); } 
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The third step in the preferred embodxment of the cede 
generation prooees oonverts the prooessed and expanded « 
into an FSA by performing a postorder traverse! of the parse 
Z« represenlin, the RBS of the top-X.v.l rule, durxng whxch 
5 JZ node is in turn converted into an FSA. The PSAs have 
several types of states, most importantly: 
rtrU Branoh to one of 2U possible successor 
Uates deeding on the ourrent character and advances the 

10 rtcti:;: hoc* 0£ actio ^ ™ r 

/ K«Ker: Records the current Input string posxt.cn. plus 
or minus a certain offset, on e mart stacfc; 
, success: When reeched, parsing has succeeded, 
5 rail: «»en reached, the current attempt at parsxng has 
l5 filled, and the PSA must either bacttracK to the prxor 
bacKtracKing point or fail if no haCctracKing poxnts 

r lla push; Pushes . configuration comprising an FSA state and 
the current input string position, onto the hacKtr.dcxng 
20 stack for later backtracking . 



Q ^^T^erred pre-processing^ stuff nodes begxns by 

expanding or flattening any nested sequences, x.e., 
2S seguenceo elements nested within other Sequence- elements. 
This is done according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 

parse tree 
30 At each parse tree node v 

IF (v =sequence<c_l r ....c_n>) THEM 

WHILE (there is an i, K=i<=n, such that 
c_i=Sequence<d_l , . . . , d _m> ) DO 

Replace node v with a new node 
Sequence <c_l , • • • , c.i" 1 ' d - 1 » 
38 d_m, c_i+l, .--r c_n> 
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Next, each occurrence of a stuff node is replaced by a 
Stuff_And_String node that also accounts for the following 
literal string and any intervening parentheses. Every 
occurrence of "stuff" in a wrapper must be followed by a 
5 literal string with the semantic result that "stuff" matches 
all characters from the current position up to, but not 
including, the first occurrence of the following literal 
string. Only open and close parentheses can intervene 
between stuff and the literal string. In the preferred 
10 embodiment, this replacement is preferably performed 
according to the following process: 

FOR nonterminal PERFORM a POSTORDER traversal of its 
parse tree. 

At each node v of the parse tree 

IF (v=Sequence<c_l, . . . , c_n>) THEN 
WHILE (there is an i, i<=±<~n, such that 
c_i=Stuff<>) DO 

Let j be the smallest j>i such that 
c_j=String<s> for some string s. 
IF there is no such j THEN signal an 
error . 

ELSE IF (any element ck (where i<k<j) is 
NOT either an OpenParenthesis or 
CloseParenthesis) THEN signal an 
. error 

/* cj is node for the string */ 
ELSE replace node v with a new node « 
Sequence <c_l, . . . ,c_i-i, 
Stuf f_And_String<cJ ; c_i+i, 
0-l>> O+lr c _n> 

This process merges the stuff node, the literal string node, 
c j, and any intervening parentheses nodes, ... c_j-i, into 
35 the single node Stuf f_And_String<the-string; c_i+i, c j_ 

1>. 
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code GsneEatjon gfeep V ^itutln. ^ Va^jafele B^inS 

Having completed pre-processing, nonterminals on the RHS 
of the top-level rule are preferably substituted and any 
action statement variables renumbered. Substitution and 
5 variable renumbering preferably use a process having two 
functions, EXPAND.REGEXP and EXPAND_ACTION and a global 
variable PAREH COUNT. EXPAND.REGEXP is called recursively to 
substitute for~RHS nonterminals, starting with a call to the 
top-level nonterminal. PABEN.COUNT is initialised to 0 and 
XO as substitution proceeds, it is incremented for each «( 
encountered. PARJ3N_C0UHT thus has a count of the total 
number of «<«* encountered so far. Then during substitution, 
parentheses encountered are renumbered from their prior 
number to the current value of PAMtf.COUKT and variables in 
15 action statements are renumbered in a corresponding fashion. 
For example, if PAREH_COUNT is currently 8, an 
OpenParenthesis< 2 > node is replaced by an 0penParenthesis<8> 
node, and an output($2) statement is replaced by an 
output ($8) statement. EXPAND_ACTION performs action 
20 statement variable renumbering. These processes are as 
preferably performed according to the following: 

GLOBAL VARIABLE integer PAREN_COUNT; 

FUNCTION EXPAND_REGEXP ( <nt> : nonterminal) : RETURNS a 
2S parse tree with renumbered variables 

/* an array of integers with new variable numbers 



*/ 



30 



35 
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LOCAL VARIABLE new_names(] 

LET T « parse tree for the RHS of rule for <nt> 
PERFORM a PREORDER traversal of T. 
At each node v of tree T 
5 IF (v=OpenParenthesis<i>) THEN 

new_names[i] = PAREN_COUNT; 
REPLACE node v with 

OpenParenthesis<PAREN_COUNT> 
PAREN_COUNT ■ PAREN_COUNT + 1; 
10 ELSE IF (v=CloseParenthesis<i>) THEN 

REPLACE node v with 

CloseParenthesis<new_names [ i ] > 
ELSE IF (v=*Nonterminal<x>) THEN 

REPLACE node v with EXPANDJREGEXP(x) 
15 ELSE IF (v~*ction<x>) THEN 

REPLACE node v with EXPAND_ACTION(x, 
new_names[]) 

FUNCTION EXPAND_ACTION ( T : parse tree T for action 

language statement; new_names[] ) : RETURNS a parse 
20 tree with renumbered variables 

PERFORM a PREORDER traversal of the parse tree T. 
At each node v, 

IF (v denotes variable name Si (i an integer) 
THEN 

25 REPLACE node v with a new node with $i 

replaced by $ (new_names[i] ) . 

Code Generation Pan Creating A Finite State Automata 

The final step in the preferred embodiment of code 

30 generation creates an FSA representing the substituted and 
processed regular expression. Although algorithms for 
creating such FSAs are known, they have not in the past 
provided facilities, for example, for embedding action 
language statements in such FSAs. Accordingly, the following 

35 description focuses on those new features of the preferred 
embodiment of this process that are directed to supporting 
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action language statements and the other novel features of 

the reg-exp language. 

An FSA output from this process starts its execution m 
an initial state with an input pointer set to the start of 
5 the input string and then repeatedly executes one of six 
basic procedures according to its current state. These 
procedures set a new current state and typically effect one 
or more of the data structures accessed by the FSA. The six 
types of states and their procedures are as follows: 

10 (i) Char branch<next_state_0 next_state_255>: 

When the current state is a charbranch state, the next state 
is selected according to the ASCII value, i, of the current 
input character as next.state.i , and the input pointer is 
advanced to the next character in the input string. 
1S (2) Marker<num, open/close_f lag, offset, next_state>: 

When the current state is a marker state, a record is pushed 
on a run-time stack known as the mark stack that contains the 
character position of a parenthesis encountered in the 
string plus the indicated offset and an indication of whether 
20 this parenthesis is open or close, and the next state is set 
to next state. An exemplary mark can push a record 
indicating that the fifth open parenthesis occurred at the 

current input position. 

(3) Action<compiled_action_statements, next_scate>: 

25 When the current state is an action state, those action 

language statements represented by compiled.action.statements 
are executed and the next state is set to next.state. 

(4) successor When the current state is the success 
state, a string has been successfully matched and this FSA 

30 terminates. . ^ . e 

(5) Push<state, next_state>: When the current state is 
a push state, the current configuration of the FSA is pushed 
onto a run-time stack known as the backtracking stack, and 
the next state is set to next.state. An FSA configuration is 

35 a record containing identification of at least the current 
state together and the current position in the input string. 
Push is used to support the non-deterministic constructs in 
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the regular expression language. An exemplary non- 
deterministic construct is the disjunction "reg-exprl j reg- 
expr2, M which is matched by an FSA that, first, pushes a 
backtracking state onto the backtracking stack, then two, 
5 attempts to run the FSA corresponding to reg-exprl; if that 
fails, then it backtracks and (c) attempts to run the finite 
state machine corresponding to reg-expr2. 

(6) Failureo: When the current state is the failure 
state, a backtrack record is popped off the backtracking 

10 stack and the next state and current input pointer are set to 
the contents of the backtrack record. If the backtracking 
stack is empty, the FSA has failed to match the input string 
and terminates. 

Creation of the FSA for an input regular expression is 

15 done using a postorder traversal of the previously produced 
parse tree. As the parse tree is traversed, to each node v 
is attached a finite state machine that matches the 
sub-expression represented by the sub- tree rooted at v. This 
attached finite state machine is the value of the variable 

20 v. machine. The final FSA output is the FSA that is attached 
to the root of the parse tree when the creation process 
completes. 

During the traversal, certain finite state machines are 
created in accordance with methods already known to those of 

25 skill in the art. These machines recognize standard regular 
expression constructs and can be constructed by known 
methods. General references for the creation of such 
standard finite state machines include the following: Aho et 
al " 1974 > Tfre pesiqn And Analysis Of Compute r Algorithm 

30 Addison Wesley Publishing Co., chapter 9; Aho et al. f 1986, 
Compilers Principles, Techniques, a nd Tools . Addison Wesley 
Publishing Co., chapter 3; Hopcroft et al., 1979, 
Introduction to Automat a Theory, Languages, and Computation , 
Addison-Wesley Publishing Co., Section 2.5; Sedgewick, 1990, 

35 Algorithms in C, Addison-Wesley Publishing Co*, chapter 16. 
Further the process disclosed adopts certain 
construction methods that can advantageously by changed in 
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alternative embodiment. For example, the machine 
representing a literal string can be constructed according to 
the Boyer-Moore or other string matching algorithm. See, 
e.g., Sedgewick at chapter 19. 
5 The preferred process is: 

PERFORM a POSTORDER traversal of the tree. 
At each node v, DO: 
IF (v«OpenParenthesis<n>) THEN 

SET v. machine s Marker <n, "open", 0, 

Success<» 
ELSE IF (v=CloseParenthesis<n>) THEN 

SET v. machine = Marker<n, "close", 0, 
Success<» 

15 ELSE IF 

(v=Action<node_for_the_action_lang U age_statements» 

THEN 

SET v. machine ■ 

Action<compiled_action_statements, 

20 Success<» 

ELSE IF (v«String<literal_string>) THEN 
SET len = length (the-string) 
FOR i = 1 to len DO 

branch state[i] = Char_branch <s_0, s_l, 
.7., s_255>, where all the s_j's are 
25 Failureo EXCEPT the s_j where j is 

the ASCII code for the i'th 
character of the-string which is 
Successo 
SET v. machine « MAKE_SEQUENCE 

30 

( br anch_s tate [ 1 ] , ♦ 
branch_state(len] ) . 



...» 



35 
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ELSE IF 

(v=stuf f J^d_string<the-string,par_l, . . . ,par_k>) 
THEN 

/* m_l can be constructed according to known 
5 methods disclosed in the previously cited 

references */ 

SET m_l - a finite state machine that scans 
the input string for the first occurrence of 
the-string and stops as soon as it comes to 
10 the end of this first occurrence. 

FOR i o 1 to k DO 

SET par_i. machine * marker state 

appropriate to parenthesis par_i and 
having an offset of 
15 -length (the-string) . 

SET v. machine « MAKE^SEQUEHCE (m_l, 

par_l. machine, parjc . machine ) . 

ELSE IF (v=Sequence<element_l, element Je>) 
THEN 

20 FOR i = 1 to k DO 

SET elemental. machine = machine 

constructed according to this 
process for elemental 
SET v. machine « KAKEJ5EQUENCE 
25 (elemental. machine, . , 

elementjc .machine) 
/* a disjunction, "|» */ 

ELSE IF (v=Altematives<element_l, • element_k>) 
THEN 

30 FOR i = 1 to k DO 

SET elemental. machine « machine 

constructed according to this 
process for element_i 

35 
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10 



SET v. machine » elementjt. machine 
FOR i = k-1 dovmto 1 DO 

v. machine = Push<v. machine, 
element_i . machine> 

/* type is "?«, ■+". or ■*■ */ 
ELSE IF (v=Repetition<type, element>) ) THEN 

SET element. machine = machine constructed 

according to this process for element 
SET v. machine = a finite state machine 
constructed according to methods 
disclosed in the previously cited 
references for this type of repetition 
element . machine 

15 This process uses a KAKE.SEQDENCE function thar ouilds a 

composite machine from a series of sub-machines. The 
composite machine executes each of the sub-machines » 
sequence until one sub-machine fails, in which case the 
composite machine also fails, or all the sub-machines 

20 succeed, in which case the composite machine succeeds. In 
other words, the composite machine runs the first sub- 
machine; if that succeeds, it runs the second; if that 
succeeds, it runs the third, and so on. If any sub-machzne 
fails, i.e., reaches a Failureo state, then composite 
25 machine also fails. 

FUNCTION MAKEJSEQUENCE (machine mJL, m_2, »J0 
: RETURNS machine, 
newjm * m_l; 
FOR i = 2 to k DO 

Traverse the states of nevjn and replace 
any Successo states with m_i 
RETURN nevjn 

This concludes the preferred construction of FSAs 
representing regular expressions from the reg-exp language of 
this invention. Next, the FSA execution is described. 
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RUN-TIME INTERPR ETATION OF REGULAR EXPRESSIONS 

In a preferred embodiment, the FSA representing a 
regular expression is executed by interpreting the structure 
created in the previous code generation steps. The 
5 interpreter preferably accesses several data structures 
including the current state of the FSA, a pointer to the 
current character position of the input string, a 
backtracking stacic and a marker stack. The backtracking 
stack contains records characterizing the state of 

10 interpretation of the FSA and is used in a manner known in 
the art to implement backtracking, in case an attempted 
partial match of the input string fails. Configuration 
records include the current state and the current input 
position. In an alternative embodiment, push states can be 

15 avoided by implementing states with a list o£ possible next 
states . 

The preferred implementation of the novel variable 
binding and action statement features of the reg-exp language 
of this invention utilizes an additional stack, the mark 

20 stack. The semantics of these features require that no 
actions be executed until the entire regular expression 
matching process succeeds. Upon success, all actions are 
then executed with the variable bindings that occurred during 
parsing. This means that a single variable, e.g. $i, can 

25 have different bindings in the same action statement which 
can be executed multiple times on match success. 

This semantics is implemented in the preferred 
embodiment using the mark stack in the following manner. 
When the current interpreted state is an Actiono state, the 

30 action language statement is not immediately executed. 
Rather the action statement code is pushed onto the mark 
stack. Similarly, when the current state is a mark state, 
which represents a parenthesis in the regular expression 
definition, the information present in the mark state is 

35 pushed onto the mark stack. This information permits finding 
the position in the input string of the beginning or ending 
of a current variable binding. To prevent execution of 
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action language statements encountered during a failed 
attempted partial match, the mark stack is popped to an 
appropriate level when the FSA backtracks. Thus, the machine 
configuration and the configuration record on the 
5 backtracking stack further includes the current position xn 

the mark stack. 

Action language statements are executed upon final match 
success by scanning the mark stack from bottom to top. In 
this scan, variable bindings are set as indicated by mark 
10 state records and actions are executed as indicated by action 
state records. In case of match failure, no action language 
statements are executed. 

in more detail, an embodiment of the preferred 
implementation functions as follows: 

19 GLOBAL VARIABLE current_state = initialized to the 

initial machine state 
GLOBAL VARIABLE input_pos = initialized to the address 

of the beginning of the input string 
GLOBAL VARIABLE BT_STACK « initialized to an empty stack 
GLOBAL VARIABLE MARKJ3TACK = initialized to an empty 
stack 

REPEAT UMTIL one of the following clauses exits 
CASE (current_state) IS 
2 5 char_branch<state_0, state_25S>: 

current_state = state, (ASCII value of the 

current input character) 
input_pos = inputj?os + 1 



20 



30 



35 
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Marker<num, open/ close_f lag, offset, nextstate> : 
push ("mark" ,num, open/ close_f lag, 

input_pos+offset) onto MARK_STACK 
current_state = nextstate 
5 Action<compiled_action_code, nextstate> : 

push ("action", compiled_action_code) onto 

MARK_STACK 
current_state = nextstate 
Push<state , nextstate> : 
10 P U£ * (state, inputjpos, height (MARK_STACK) > 

onto BT_STACK 
current_state = nextstate 
/* exit the REPEAT statement on final success */ 
Successor GOTO done 
15 Failureo: 

/* match finally fails if backtracking is 
attempted and the backtrack stack is empty */ 
IF BT_STACK is empty THEN 
RETURN "fail" 

20 ELSE 

pop BT_STACK record into variables st # 
ip , msh 

current_state = st 
inputjpos * ip 
25 pop MARKJ3TACK down to height msh 

END CASE 

/* when parsing succeeds, scan KARKJ5TACK and execute 
actions with variable bindings indicated by the marks * 
done: 

30 LOCAL VARIABLE openjaarks [ ] , close_marks[ ] 

FOR element = bottom of MARK_STACK up to top of 
MARK_STACK DO: 

IF (element= ( "mark" , num , "open" , pos ) ) THEN 

SET open_marks[num] = pos 
35 if (element^ ( "mark", num, "close", pos) ) THEN 

SET close_marks[num] = pos 
IF (element- ( "action" , compiled_action_code) ) THEN 
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SET values of $0, $1 §n to the positions 

in the input string indicated by 
open_marks[n] and close_marks[n] 

RUN compiled_action_statement using these 
5 bindings of $0, $1, $ n 

END FOR 

/* match succeeded and all action executed */ 
RETURN "succeed" 

10 For example, in the preceding FOR loop, the binding of $2 is 
set to the sub-string of the input string beginning at 
openmarkst2] and ending at closejaarks [ 2 ] . 

,^ T q M T-Hicra - CODE r.P. W ^ATTON AND PTTN-TTHE INTERPRETATION 
15 in a preferred embodiment, action language statements 

are compiled into a variable length bytecode-type of 
intermediate language that is executed at run-time by a 
bytecode interpreter. The previously described action 
language syntax is parsed by any appropriate known parsing 
20 technique, for example by It or a parsing. Adequate parsing 
techniques are presented with examples in Aho et al. , 1986, 
^- r „^ Principles Technic s, ™d Tools., Addison Wesley 
Publishing Co., chapter 4. During parsing all variables 
occurring in the wrappers are assigned unique numbers. A 
25 preferred such assignment assigns sequential positive 

integers to named variables, e.g., $x, $abc, in the order 
they are encountered, and assigns sequential negative 
integers to the numeric variables denoting matched 

tft C2 etc. . the numeric variable 

sub-strings, e.g., $0, 51, §2, etc., 

30 $i being assigned the number _ . , 

intermediate code is generated by known syntax-directed 
translation techniques. Adequate syntax directed translation 
techniques are presented with examples in Aho et al. at 
chapter 8. The preferred intermediate code is presented 
35 below, in this presentation, intermediate language 

instructions have the format of an instruction code, with 
optional modifying information, followed by zero of more 
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arguments. Variables are denoted by <var> where var is the 
2 -byte integer coding of the variable. Relative branch 
offsets are encoded as 4 -byte integers. All branches are 
relative to the current instruction address, i.e., the 
5 address of the branch instruction itself, so the bytecode is 
relocatable. 

1. Function Calls: 

Encoding: 0 funcode numargs <var_0> <var_l> 

10 <var_numargs> 

Meaning: varO is assigned the result of the function 
identified by "funcode" on the arguments var_l, 
var_numargs of number "numargs." Each built-in 
function, e.g. agrc, argv, etc., and each operator, 

15 e.g., etc... or the language has its own 

unique integer "funcode". 

2. Branch Instructions: 

Encoding: l <offset> <var0> 

Meaning: if varO is true, branch by amount offset 
20 relative to the branch instruction 

Encoding: 2 <offset> <var0> 

Meaning: if varO is false, branch by amount offset 

relative to the branch instruction 
Encoding: 3 <offset> 
25 Meaning: always branch by amount offset relative to the 

branch instruction 

3. Load Constant Instructions: 

Encoding: 4 <var0> <null-terminated-string> 
Meaning: varO is assigned the null-terminated-string 
30 constant 

Encoding: 5 <varO> <f loating-point-constant> 
Meaning: varO is assigned the floating-point-constant 

4. Move Instructions: 

Encoding: 6 <var0> <varl> 
35 Meaning: varO is copied from varl 
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5. output instructions: 
Encoding: 7 <varO> 

Meaning: outputs one item varO to the current tuple 
Encoding: 8 

5 meaning: tnlMf the current tuple, causes it to be 
returned to the netbot 
6. Parsing Instructions: 

Encoding: 9 <varO> <varl> <address of the finite state 

machine for <nt» 
., Meaning: verO is assigned the result o£ the perse of 

vsri according to the regular expression denoted by 

<nt> 

7. Exit Instructions: 
Encoding: 10 

15 Meaning: exits the current action language -loo. 

in alternative embodiments, different intermediate codes 
or even no intermediate code can be used. For example, 
c !Hd of Previously disclosed address-based intermediate 

code could be used. Optionally, no ^* d ^\ C ^* tQ 
used, and action language statements are compiled directly 
a machine language. Both these alternative can be 
implemented by using the disclosed syntas-directea methods. 

25 See, e.g., Aho et al. performed in a preferred 

Execution of wrapper actions is perron* 
embodiment in several steps. First, memory is allocated to 
Tore all variables and variable values. The number of 
tables is Known after the action statement have been 
^ m variables are initialized to an unassigned 
30 IZT Hett, Tenter is initialized to 

point to the first instruction of the bytecode to be 
^ „ i* m the interpreter enters a loop which 

^Tet t^oLe Ported to by TO n.s™=Txo,, 
„ p^orTth. coTed sinple action according to the bytecode 



- 62 - 



BNS0OCI0: <WO 93.2B8lA2_!.> 



WO 98/12881 



PCT/US97/17132 



or for taken branch instructions, is modified by the offset 
value. The interpreter finishes execution upon encountering 
an exit instruction. 

The parse built-in function is executed at run-time by 
5 calling the FSA interpreter to interpret the compiled regular 
expression code specifying the finite state machine. 
Output <a_l , .. . ,a_n) statements are translated and then 
executed by executing bytecode instruction code 7 for each 
variable in the output list followed by a bytecode 
10 instruction code 8 that terminates the current tuple, i.e., 
marks the end of the output statement. 

*• SPECIFIC EMBODIME NTS, CITATION OF REFERENCES 
The present invention is not to be limited in scope by 
15 che specific embodiments described herein. Indeed, various 
modifications of the invention in addition to those described 
herein will become apparent to those skilled in the art from 
the foregoing description and accompanying figures. Such 
modifications are intended to fall within the scope of the 
20 appended claims* 

Various publications are cited herein, the disclosures 
of which are incorporated by reference in their entireties. 



25 



30 



35 
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WHAT IS CLAIMED IS 
1. 



10 



1* 



20 



2. 



25 3 



30 



35 



K method for assisting a user to query for information 

available from information sources attached to a 

network, said method comprising: 

>.w _ OT -p, information sources most 

a. selecting the one or more imoimaux 

relevant to a user query; 

b formatting said user query for each said relevant 
information source according to a description of each 
said relevant information source written in a wrapper 
description language; 

c. transmitting said formatted query to each of said 
relevant information sources; 

d extracting data fields relevant to said user query 
from responses returned trom said relevant information 
sources, said extracting according to said description 
of said relevant information source returning each said 
response ; and 

e. presenting said relevant data fields to said user. 

The method of claim 1 further comprising, after said 
extracting step, a step of estimating the relevance of 
said responses to said user query. 

The method of claim 1 where said selecting step further 
comprises the steps of: 

„. aet er.inin, one or .ore concepts to which sa,d user 
ouery relates; and 

I. selecting an information source as relevant if said 
information source relates to said concepts. 

. The method of claim 1 wherein said tapper description 
language comprises facilities for the specification of 
an entire regular expression in terms of component 
regular expressions such that, upon recognition of one 
component regular expression of said entire ~^ ar 
expression, actions can be executed with variables bound 
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as of the time of recognition of said component regular 
expression, said actions actually being executed only if 
said entire regular expression is also recognized. 

S 5. A system for assisting a user to query for information 
available from information sources attached to a 
network, said system comprising: 

a. a first processor means for selecting the one or 
more information sources most relevant to a user query; 

10 b. a second processor means for formatting said user 

query for each said relevant information source 
according to a description of each said relevant 
inf ormation source written in a wrapper description 
language, and for transmitting said formatted query to 

15 each of said relevant information sources; 

c. a taird processor means for extracting data fields 
relevant to said user query from responses returned from 
said relevant information sources, said extracting 
according to said description of said relevant 

20 information source returning each said response; and 

d. a fourth processor means for presenting said 
relevant data fields to said user. 

6. The system of 5 wherein said first, said second, said 
25 third, and said fourth processor means reside in one 

network attached computer, said computer comprising a 
CPU and memory. 

7. The system of 5 wherein said first, said second, said 
30 third, and said fourth processor means reside in 

separate network attached computers, each said computer 
comprising a CPU and memory 

8. The system of 5 wherein said fourth processor means 

35 presents said relevant data fields by executing a World 

Wide Web browser • 
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10. 



10 



15 



The system of 5 further comprising a first network 
attached memory means for storing the descriptions of 
said information sources and for providing said 
descriptions upon request of said second or third 
processor means. 

A wrapper description language method comprising 
recognizing a string by scanning said string according 
to a regular expression written in a regular expression 
language, said regular expression language capable of 
specifying an entire regular expression in terms of 
component regular expressions such that, upon 
recognition of one component regular expression of said 
entire regular expression, actions can be executed with 
variables bound as of the time of recognition of said 
component- regular expression, said actions actually 
being executed only if said entire regular expression is 
also recognized. 

20 11. The wrapper description language method of claim 10 
further comprising the step of executing one or more 
statements written in an action language, said action 
language capable of specifying the actions of operators 
on variables, the actions of functions on variables, .nd 
of the sequence of statements, the statements of said 
action language being further capable of being embedded 
in said regular expression language for specifying said 
actions occurring on the recognition of said component 
regular expressions. 

A computer readable medium containing instructions for 
causing one or more computers to perform the method of 
claim 1. 

A computer readable medium containing instructions for 
causing one or more computers to perform the method of 
claim 10. 
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